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Abstract 



Background: Aligning protein structures is a highly relevant task. It enables the 
study of functional and ancestry relationships between proteins and is very important 
for homology and threading methods in structure prediction. Existing methods typically 
only partially explore the space of possible alignments and being able to efficiently handle 
permutations efficiently is rare. 

Results: A novel approach for structure alignment is presented, where the key ingre- 
dients are: (1) An error function formulation of the problem simultaneously in terms of 
binary (Potts) assignment variables and real-valued atomic coordinates. (2) Minimiza- 
tion of the error function by an iterative method, where in each iteration a mean field 
method is employed for the assignment variables and exact rotation/translation of atomic 
coordinates is performed, weighted with the corresponding assignment variables. The 
approach allows for extensive search of all possible alignments, including those involving 
arbitrary permutations. The algorithm is implemented using a (^-representation of the 
backbone and explored on different protein structure categories using the Protein Data 
Bank (Pdb) and is successfully compared with other algorithms. 

Conclusions: The approach performs very well with modest CPU consumption and 
is robust with respect to choice of parameters. It is extremely generic and flexible and 
can handle additional user-prescribed constraints easily. Furthermore, it allows for a 
probabilistic interpretation of the results. 
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Introduction 



Aligning protein structures is a subject of utmost relevance. It enables the study of func- 
tional relationship between proteins and is very important for homology and threading 
methods in structure prediction. Furthermore, by grouping protein structures into fold 
families and subsequent tree reconstruction, ancestry and evolutionary issues may get 
unraveled. 

Structure alignment amounts to matching two 3D structures such that potential common 
substructures, e.g. a-helices, have priority. The latter is accomplished by allowing for 
gaps in either of the chains. Also, the possibility of permuting sites within a chain 
may be beneficial. At first sight, the problem may appear very similar to sequence 
alignment, as manifested in some of the vocabulary (gap costs etc.). However, from an 
algorithmic standpoint there is a major difference. Whereas sequence alignment can be 
solved within polynomial time using dynamical programming methods U , this is not the 
case for structure alignment since rigid bodies are to be matched. Hence, for all structure 
alignment algorithms the scope is limited to high quality approximate solutions. 

Existing methods for structure alignment fall into two broad classes, depending upon 
whether one (1) directly minimizes the inter-atomic distances between two structures or 
(2) minimizes the distance between substructures that are either preselected or supplied 
by an algorithm involving intra-atomic distances. 

One approach within the first category is the iterative dynamical programming method 
0, [|, where one first computes a distance matrix between all pairs of atoms (e.g. C a ) 
forming a similarity matrix, which by dynamical programming methods gives rise to an 
assignment matrix mimicking the sequence alignment procedure. One of the chains is 
then moved towards the other by minimizing the distance between assigned pairs. This 
method does not allow for permutations. Another inter-atomic approach is pursued 
in ||, where the area rather than distances between two structures is minimized. 

In || the approach is different. Here one compares distance matrices within each of 
the two structures to be aligned, which provide information about similar substructures. 
The latter are subsequently matched. A similar framework is used in || and also in 0. 
Not surprisingly, in [[5], ^| and permutations can in principle be dealt with. 

There are implementation issues shared by both methodologies above. One is structure 
encoding (C a and/or Cp of the chains). For many comparisons C a appears to be suffi- 
cient, whereas in some cases Cp is needed. Also, the choice of distance metric is a subject 
of concern in order to avoid the influence of outliers. 

The iterative dynamical programming method |§ has been extensively assessed for back- 
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bone structures || from the Scop database, in which protein structures have been 
classified by visual inspection. Some comparisons with Scop have also been performed 
PH| | using the method in |J. 

Here we present a novel approach, which shares some of its philosophy from the iterative 
dynamical programming method || . Its key ingredients are: (A) An error function for- 
mulation of the problem simultaneously in terms of binary (Potts) assignment variables 
and real-valued atomic coordinates and (B) minimization of the error function by an 
iterative method, where each iteration contains two steps: 

1. A mean field procedure for minimizing with respect to the assignment variables. 

2. Exact rotation and translation of atomic coordinates weighted with the correspond- 
ing assignment variables. 

The approach, which is very general, has some very appealing properties: 

• Implicit complete exploration of the entire space of alignments, which allows for 
arbitrary permutations. To our knowledge, no other approach has this feature. 

• Probabilistic interpretation of the results. This feature is present without tedious 
Monte Carlo estimates since the algorithm is deterministic. Among other things, 
this implies that the approach is less sensitive to the choice of distance metric, since 
the distances are weighted with fuzzy numbers. 

• With its generality, almost arbitrary additional constraints are easily incorporated 
into the formalism including different functional forms of gap penalties. 

The approach is tested using C a -representation of backbones, by comparing the results 
with the approaches of || and || as implemented in the Yale Alignment Server and 
Dali respectively and in one instance also with || (Entrez) . In choosing protein pairs 
to align we followed || to a large extent. In [|J pairs with marginal sequence overlap but 
where each protein in a pair belongs to the same SCOP superfamily and therefore have 
a similar structure were picked for assessment. We selected pairs from a varied selection 
of the families used in || to test our algorithm: 

• Dihydrofolate Reductases (a//3) 

• Globins (all-a) 

• Plastocyanin/azurin (all-/?) 

• Immunoglobulins (all-/5) 

In addition, we test the permutation capacity of our approach, by aligning: 
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• Permuted proteins (winged helix fold) 



When assessing the algorithm, we limit ourselves to a core version, where Cp degrees 
of freedom are not included. Also, no post-processing of the results is done. We defer 
such elaborations and others to forthcoming publication. Nevertheless, the core version 
of our approach is already very competitive even for chains, where permutations are 
not needed. For the latter case, the other algorithms could not be tested using the 
corresponding WWW-servers. In the instances, where we have tested it for this kind of 
problems, it also performs well. 

The algorithm is implemented in C++. Given its generality and power, the CPU demand 
is quite modest - it scales like the chain lengths squared and on the average requires a 
few seconds on a Pentium 400MHz PC. 



Methods 
The Algorithm 

In what follows we have two proteins with Aq and N 2 atoms to be structurally aligned. 
This is accomplished by a series of weighted rigid body transformations of the first chain, 
keeping the second chain fixed. We denote by Xj (i = 1, Aq) and (j = 1, N 2 ) the 
atom coordinates of the first and second chain, respectively. The phrase " atom" will be 
used throughout this paper in a generic sense - it could represent individual atoms but 
also groups of atoms. In our applications it will mean C Q -atoms along the backbone. A 
square distance metric between the chain atoms is used, 



but the formalism is not confined to this choice. 

We start by discussing the encodings and error function and then we present a method 
for minimizing the latter. 

The Gapless Case. For pedagogical reasons, we start off with the gapless case with 
Aq — N 2 . We define binary assignment variables Sy such that s^- = 1 if atom % in one 
chain matches j in the other and s^- = otherwise. Since every atom in one chain must 
match one atom in the other, the following conditions must be fulfilled: 



4 = |xi - yj 



(1) 



Y,Sij = l j = l,...,N 2 



(2) 



i=i 



3 



N 2 



iVi 



(3) 



A suitable error function to minimize subject to the above constraints (Eqs. (Q,|3|)) is 



E. 



clia 



i=l j=l 



(4) 



where the spatial degrees of freedom, Xj, are contained in the distance matrix d?.. Thus 
whenever Sy=l one adds a penalty d|- to E chain . Note that Eq. (f|) is to be minimized 
both with respect to the binary variables and the real- valued coordinates x$. 

The Gapped Case. Allowing for gaps in either of the chains is implemented by extend- 
ing Sij to include O-components in a compact way; s i0 = 1 and s j = 1 if an atom (i or j) 
in one chain is matched with a gap in the other and vice versa. Hence, gap positions 
are not represented by individual elements in Sy-; rather the gap-elements correspond to 



common sinks. 
Eq. (§). 



The matrix S, with elements s,,-, containing gap-elements is shown in 



S 



SlO 
■§20 



Sol S 02 



Sll 
S21 



Sl2 
S22 



S0iV 2 \ 



S2iY 2 



(5) 



\ SjViO S A r il s JVi2 •••• s 7Vi7V 2 / 



Some caution is needed when generalizing Eqs. (H|f) to host gaps, since the elements of 
the first row and column (gap-mappings containing the index 0) in Eq. @ differ from 
the others in that they need not sum up to 1. Hence Eqs. ([|JD becomes 



£' 

i=0 
N 2 

£■ 

3=0 



1; j = l,...,iV2 
1; i = 1, . . . , iVi 



where the first condition can be rewritten as 



0; j 



i=l 



i=l 



(6) 



(7) 



The encoding (sy) of matches and gaps is illustrated in Fig. |l]with a simple example. 
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Figure 1 : A simple example of the assignment matrix S (right) corresponding to the matching of the 
two toy chains (left). 



Assuming a constant penalty per inserted gap one has the error function 

Nl 1 N 2 

E = -Echain + Aj SjQ + Xj SQj 



(8) 



t=i 



where A^is the cost for matching atom % in the first chain with a gap in the second 



chain, and similarly for A!- . The position dependence of the gap costs, A^ ±; and A^ 
originates from the fact that it is desirable not to break a-helix and /3-strand structures. 



(i) 



,(2) 



In Eq. (H) the gap penalties are proportional to gap lengths. In sequence alignment it 
is conjectured that gap penalties consist of two parts; a penalty for opening a gap and 
then a penalty proportional to the gap length. As in ||, we will for structure alignment 
here adopt the same gap cost philosophy, i.e. A| and A^ 2 for opening a gap and a 
position-independent 5 per consecutive gap. Hence, Eq. (BJ) generalizes to 



E 



Ni N 2 

-^chain + Aj SjQ + Xj S()j 



i=l 



3=1 



Nj N 2 

+ ( S - *i) s i-i,oS i0 + ( S - X ?) s o,j-i s oj 



(9) 



i=2 



where products like Sj are 1 if two adjacent atoms are matched to gaps. 



Minimization. Next we need an efficient procedure for minimizing E with respect 
to both Sij and Xj subject to the constraints in Eqs. (§],|7|)- As mentioned above, this 
minimization problem is non-trivial due to the rigid body constraint. A similar problem 



in terms of fitting structures with relevance factors was probed in for track finding 
problems with a template approach using the mean field approximation. Here we will 
adopt a similar approach. 



5 



In our formulation, the inherent optimization difficulty resides in the binary part (sy) 
of the problem. Hence, minimizing Eq. ([5J) using a simple updating rule for will very 
likely yield poor solutions due to local minima. Well known stochastic procedures such 
as simulated annealing (SA) fl2| for avoiding this are too costly from a computational 
standpoint. In the mean field (MF) approach the philosophy behind SA is retained, 



but the tedious simulations are replaced by an efficient deterministic process. The bi- 
nary variables Sij are then replaced by continuous mean field variables Vij G [0, 1], with 
a dynamics given by iteratively solving the MF equations for a decreasing set of temper- 
atures T down to To, where most of the approach either 1 or 0. These continuous MF 
variables can evolve in a space not accessible to the original intermediate variables. The 
intermediate configurations at non-zero T have a natural probabilistic interpretation. 

For satisfying Eq. (0), the MF equations for the corresponding read 

e u i: j/T 

«v = ; i=l,-,N 1 (10) 



where the force it™ is given by 



N 2 
k=0 



dE . . 

Uij = —- — (11) 



dvij 



and is computed by substituting s^- with in E (Eq. @). Note that the desired 
normalization condition, Eq. (^), 

N 2 

XX- = i; i = i,...,JVi (12) 

3=0 

is fulfilled automatically in Eq. (pTOf) . The other condition (Eq. ([?])) is enforced by adding 
a penalty term 

N 2 Ni Nx 

e, = 7E[(E%)(E^-!)] 

j=l 1=1 k=l 

JVi Ni N 2 

= tEEE%% (13) 

i=l kj^i j=l 

where 7 is a parameter and the last equality follows from the fact that vf- = for T=0. 

So far we have only looked at the assignment part when minimizing the error function. 
When updating the mean field variables f^, using the MF equations, the distance mea- 
sure dfj is a fixed quantity. This corresponds to having the chains at fixed positions. 
However, we also want to minimize the distance between the two chains. Based on the 
probabilistic nature of the mean field variables we propose to update the chain positions 
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using the (fuzzy) assignment matrix V, with elements Vij. This is done simultaneously 
with the updating of v^. Explicitly, one of the chains will be moved in order to minimize 
the chain error function -E chain (Eq. (M). 

The distance measure d?- depends on the translation vector a and the rotation matrix TZ, 
making a total of six independent variables. Let be the coordinates of the translated 
and rotated protein, i.e. x • = a + T&q, then 

^chain = Yj % ( a + ~yjf ( 14 ) 

i=l j=l 

This minimization problem can be solved exactly with closed-form expressions for TZ and 
a that minimizes £? chain fll4| . It should be noted that this solution is rotationally invariant 



(independent of TZ) for the special case when the atoms in the two chains matches each 
other with the same weight, i.e. when = constant for all i and j, which is the case 
for high T. 

In summary, for a decreasing set of temperatures T, one iterates until convergence: 

1. The MF equations (Eq. (0)). 

2. Exact translation and rotation of the chain (Eq. (|HP). 

We stress again that step 2 is done with the fuzzy MF assignment variables and not 
with the binary ones, s^. After convergence, i>y are rounded off to or 1 and rms (root- 
mean-square-distance) is computed for the matching pairs. Algorithmic details can be 
found in the next subsection. 

The forces it^ entering Eq. fllOf) are proportional to d?- (Eqs. (1,0))- It is the ratio dfj/T 
that counts. Hence, for large temperatures T, are fairly insensitive to dij and many 
potential matching pairs contribute fairly evenly. As the temperature is decreased, 
a few pairs (the ones with small d^) are singled out and finally at the lowest T only 
one winner remains. One can view the situation as that around each atom i one has a 
Gaussian domain of attraction, which initially (large T) has a large width, but gradually 
shrinks to a small finite value. 

The fuzziness of the approach is illustrated in Fig. |2], where the evolution of uy, as T is 
lowered, is shown for parts of the first helices of 1ECD and 1MBD (see next section) 
together with snap-shots of the corresponding chain sections. At high T all v ^ are similar; 
all potential matches have equal probability. At lower T, several have approached 
or 1 and the movable chain is moving in the right direction. At yet lower T, note 
that a few converge later than the majority. These are in this example related to the 
matching of the last atom in one of the chains. This atom has two potential candidates 
to match resulting in a number of that converge last. 
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(c) (d) 




Figure 2: Illustration of the fuzziness of the approach. The alignment shown is for 10 atoms in the first 
helices in the proteins 1ECD (blue) and 1MBD (red), (a) Evolution of all the 120 Vij as a function of 
iteration time r (T is lowered with r). (b) Positions of the atoms at r = 1. For high T every atom in 
a protein feels all the atoms in the other protein and the problem is rotationally invariant, (c) r = 12; 
most of the relevant matchings are forcing the system to move in the right direction, (d) r = 50; the 
final assignments are done. The different snapshots are presented using different projections. Some v^j 
approach or 1 rather late and they are coloured green. These i>y are related to the atom at the end 
of the 1MBD segment, which also is coloured green, and as can be seen in (c) the difficulty is whether 
to align this atom to the last or second last atom in the 1ECD segment. 



Implementation 



Here we give a very condensed, but yet self-contained and detailed description of the 
algorithm and the parameters involved, such that the results of this paper are repro- 
ducible. 
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Parameters. Two kind of parameters are used; the ones related to encoding of the 
problem (7) and iteration dynamics (e), where e governs the annealing schedule (see 
below), and the ones specifying gap costs (A, 5). The same set of parameters can be 
used for most of the pairs (see Table |l|); the algorithm is remarkably stable. 



Protein Family 


e 


7 


A 


^shcct 


^helix 


S 


ct//3, all- a 


0.8 


0.065 


0.10 


1.5A 


1.5A 


A/2 


Plastocyanin/azurin 


0.8 


0.035 


0.10 


2.0A 


2.0A 


A/5 


Immunoglobulins 


0.8 


0.040 


0.15 


2.0A 


2.0A 


A/5 


Winged helix fold 


0.8 


0.070 


0.20 


2.0A 


2.0A 


A/5 



Table 1: Parameters used in the algorithm. The first family involves 27 pairs, whereas the others one 
each. 

Initialization. An initialization of the chains is made prior to the mean field alignment. 
First both chains are moved to their common center of mass. For the random initializa- 
tion, this move is then followed by a random rotation of one of the chains. Most of the 
times, however, a sequential initialization is used that consists of minimizing Eq. @ us- 
ing a band-diagonal assignment matrix S. This corresponds to a situation where, on the 
average, atom i in one of the chains is matched to atom i in the other. If not explicitly 
mentioned, sequential initialization is used for all the protein pairs in this paper. 

Iteration Steps. The shortest chain is always chosen as the one that is moved (xj). The 
mean field variables are updated according to Eq. ( |T0| ) where, in order to improve 
convergence, the derivatives in Eq. ( ]TT| ) are replaced by finite differences (see e.g. [Pp. 
This update equation accounts for all mean field variables except for the first row of V, 
which is updated according to 

iVi 

w oj = l-X^i' j = 1, iV 2 (15) 
i=i 

The algorithmic steps are shown in Fig. [3[ After convergence, no post processing is 
applied for the results in the next section. 



Results 

To test the quality of our alignment algorithm, we have compared alignments of protein 
pairs with results from other automatic procedures. For most of the tested pairs, each 
protein belongs to the same Scop superfamily. The goal here is not a full investigation 
of all families but rather to explore a limited set with representative variation. Pairs were 
picked from a selection of the families investigated in |§ . Our choice of pairs is essentially 
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1. Initialization. 

2. Rescale coordinates such that the largest distance between atoms within the 
chains is unity. 

3. Initiate all Vij close to 1/ max(7Vi, AT 2 ) (randomly). 

4. Initiate the temperature (e.g. T = 2). 

5. Randomly (without replacement) select one row, say row k. 

6. Update all Vkj, j = 0, N 2 according to Eq. (|l~0|). 

7. Repeat items 5 — 6 N\ times (such that all rows have been updated once). 

8. Repeat items 5 — 7 until no changes occur 
(defined e.g. by l/(JViJVa) Ey K ~ 4f d) \ < 0.0001). 

9. Rotation and translation of the shortest chain using the fuzzy assignment matrix 
V. 

10. Decrease the temperature, T — > eT. 

11. Repeat items 5 — 10 until all wy are close to 1 or 
(defined e.g. by 1/N X £V > 0.99). 

12. Finally, the mean field solution is given by the integer limit of Vij, i.e. 

for each row i, i = l,...,Ni select the column j* such that Vij* is the largest 
element for this row. Let s«» = 1 and all other su = for this row. 



Figure 3: Algorithmic steps. 



based on two criteria. First, the pairs should have diverse structures, and in particular 
include all-a, all-/?, and a/ j3 proteins. Second, in || some families are considered to be 
very easy, easy and difficult to align, respectively, and we included pairs from all these 
categories. In addition we have tested the algorithm on cases where permutations are 
needed. 

Our results are compared with the Yale Alignment Server (http: / /bioinfo.mbb.yale. 
edu/align/) and Dali (http://www2.ebi.ac.uk/dali/). The Yale server applies post 
processing to its alignments by removing aligned pairs with too large root-mean-square- 
distance (rms) in an iterative manner subject to a termination criteria. A similar proce- 
dure is of course possible in our approach, but we have chosen at this stage to keep the 
algorithm clean. In the comparisons below we have for the Yale server quoted results 
both before and after the post-processing. 



Unless otherwise stated, proteins are in what follows denoted by their Pdb [W\ identifier, 
and in the case of chains or parts of chains with their Scop domain label. A summary 
of the results in terms of rms and the number of aligned atoms (N) is shown in Table |2| 
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Figure 4: rms and N corresponding to Table |^. The Yale data correspond to no post processing (see 
text). 



and in Fig. ||. Detailed comments upon these results and some additional ones can be 
found below. 

With regard to the general performance one must keep in mind that it is not straight- 
forward to assess alignment algorithms in terms of e.g. rms and N, since there are no 
obvious figure-of-merits. It is interesting to notice though that when inspecting aligned 
core regions in detail, we are close to the Yale alignments but in general with a lower 
rms. However, in such comparisons, we differ more from DALI. The YALE algorithm 
has been subject to comparison with Scop classifications using a multiple alignment 
procedure ||, giving its and our alignments a higher credibility. 

Dihydrofolate Reductases {a/ (3). These proteins belong to the Scop class a/ (3, which 
contains a- and /3-proteins that have mainly parallel beta sheets. They are considered 
very easy to align ||. If we compare alignments of core structure parts using the three 
methods we find that they all essentially agree. However, one notes that the Yale results 
are very sensitive to the post processing. 

Globins (all-a). In the all-a Scop class we particularly study a set of globin proteins. 
In general, we get lower rms than the other algorithms for the same number of aligned 
residues. When comparing alignments from the three algorithms we find that an im- 
portant aspect of our algorithm is manifested - allowing for permutation of individual 
atoms. The reason for this is that to optimally align secondary structures it is often 
beneficial to have a few permuted residues in loops between the secondary structures. 
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Protein family 


Protein Pair 


Yale 


Dali 


Lund 








rms 


N 


rms 


N 


rms 


N 


Dihydrofolate 


1DHFA 


- 8DFR 


1.7 


(0.7) 


182 


(182) 


0.7 


182 


0.7 


182 


Reductases 


IDHFa 


- 4DFRA 


2.7 


(1.2) 


155 


(130) 


2.0 


155 


1.9 


154 




IDHFa 


- 3DFR 


2.5 


(1.2) 


159 


(143) 


1.7 


158 


1.7 


159 




8DFR 


- 4DFRA 


2.8 


(1.3) 


156 


(131) 


2.1 


151 


1.9 


154 




8DFR 


- 3DFR 


2.6 


(1.3) 


160 


(146) 


2.0 


160 


1.7 


160 




4DFRA 


- 3DFR 


2.4 


(1.1) 


157 


(140) 


1.5 


152 


1.5 


153 


Globins 


2HHBA 


- 2HHBB 


2.3 


1.2 ) 


139 


(129) 


1.5 


139 


1.4 


139 




2HHBA 


- 2LHB 


2.7 


(1.6) 


131 


(123) 


1.8 


128 


1.9 


130 




2HHBA 


- 1MBD 


2.4 


(1.5) 


141 


(138) 


1.5 


139 


1.5 


141 




2HHBA 


- 2HBG 


2.4 


(0.8) 


138 


(105) 


1.7 


138 


1.6 


137 




2HHBA 


- 1MB A 


2.9 


(2.2) 


138 


(134) 


2.3 


136 


2.2 


138 




2HHBA 


- 1ECD 


3.1 


(2.2) 


130 


(126) 


2.3 


129 


2.2 


130 




2HHBB 


- 2LHB 


2.5 


(1.3) 


136 


(126) 


1.7 


134 


1.6 


134 




2HHBB 


- 1MBD 


2.3 


(1.4) 


145 


(138) 


1.6 


145 


1.4 


143 




2HHBB 


- 2HBG 


2.4 


(1.4) 


136 


(125) 


2.0 


135 


1.6 


133 




2HHBB 


- 1MBA 


3.0 


(2.2) 


140 


(137) 


2.3 


138 


2.2 


139 




2HHBB 


- 1ECD 


2.8 


(2.2) 


136 


(134) 


2.3 


129 


2.1 


134 




2LHB 


- 1MBD 


2.4 


(1.0) 


137 


(121) 


1.4 


135 


1.4 


136 




2LHB 


- 2HBG 


2.7 


(1.5) 


131 


(119) 


2.0 


128 


2.1 


130 




2LHB 


- 1MBA 


2.7 


(1.8) 


138 


(130) 


1.9 


135 


1.9 


132 




2LHB 


- 1ECD 


2.7 


(1.9) 


130 


(127) 


2.0 


128 


1.9 


128 




1MBD 


- 2HBG 


2.5 


(1.6) 


139 


(130) 


2.1 


139 


1.8 


137 




1MBD 


- 1MBA 


2.5 


(1.7) 


143 


(137) 


1.9 


142 


1.8 


142 




1MBD 


- 1ECD 


2.2 


(1.6) 


136 


(134) 


1.9 


136 


1.6 


136 




2HBG 


- 1MB A 


2.9 


(2.2) 


139 


(136) 


2.4 


137 


2.2 


135 




2HBG 


- 1ECD 


3.3 


(2.5) 


128 


(125) 


2.6 


129 


2.4 


125 




1MBA 


- 1ECD 


2.8 


(1.7) 


134 


(125) 


1.9 


133 


1.9 


135 


Plastocyanin/azurin 


1PLC 


- 1AZU 


4.7 (2.9) 


91 


(85) 


2.6 


86 


2.1 


78 


Immunoglobulins 


7FABl2 


- 1REIA 


3.5 


(2.6) 


83 


(79) 


2.6 


78 


3.0 


89 



Table 2: The root- mean-square-distance (rms) and the number of aligned residues (N) from the align- 
ment of different protein pairs. The results are presented for several automatic alignment procedures; 
Lund refers to this work. For Yale the numbers within parenthesis refer to after post processing (see 
text). 



If we again compare the core parts of the alignments from the three algorithms we find 
that they agree on a large fraction of the parts. 

Plastocyanin/azurin (all-/?). A11-/5 proteins are difficult to align if one only takes 
backbone coordinates {C a or Cp) into account, even though using Cp instead of C a 
coordinates, in general, improves the results. As an initial example of all-/? proteins we 
have looked at plastocyanin versus azurin. Even though this alignment is slightly more 
difficult than the previous cases, all three methods give similar rms and N and they all 
agree on the alignment of a majority of the core parts. For this example several restarts 
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were performed with random initialization. 

Immunoglobulins (all- 0). A more difficult example of all-/? proteins is immunoglob- 
ulins. We align the domain 7FABl2 with the chain IREIa and find that we can find 
alignments with low rms that look good. However, if we investigate the alignment in 
detail we find that atoms in all core regions, except one, are misaligned. This is also the 
case in ||, where the same alignment is investigated. To get the core regions correctly 
aligned in || they improve their method and take side chain orientation into account. 
We expect that this is the case for our method too. When aligning strands using only 
C a coordinates, strands in the two proteins are often matched satisfyingly to one an- 
other while the individual atoms are aligned such that one strand is translated with 
respect to the other. It is therefore obvious that side chain orientation is very important 
when aligning strands. For this example several restarts were performed with random 
initialization. 

Permuted proteins — winged helix fold. Finally we look at permuted versions of 
similar folds. We compare two DNA binding domains related to transcription regulation. 
The compared domains both have the winged helix fold but one of them has the secondary 
structures in a circularly permuted order. This is a case where iterative dynamical 
programming algorithms will fail. We look at ILEA and compare it to the Entrez- 
Mmdb JL7! structural domain 4 in chain B of 1XGN. This part of 1XGN is classified as 
a circularly permuted winged helix fold in Scop. In the Entrez-Mmdb database, which 
uses Vast (http://www.ncbi.nlm.nih.gov/Structure/VAST/) for alignments, 1XGNb4 
is listed as a low priority structural neighbour to ILEA, even though Vast does not 
allow for permutations of secondary structure. If one looks at the actual alignment one 
finds that the permuted secondary structures are not aligned. In Figure [5] we compare our 
alignment with Vast. We show the sequential parts of our alignment and in particular 
all parts with secondary structure are shown. Vast aligns only 39 residues in this 
comparison, while we align 60. We note that we get all the 39 of the aligned residues 
of Vast but that we in addition align the sheet at the end of ILEA with the sheet at 
the beginning of the domain in IXGNb. This demonstrates the importance of having 
a procedure that takes permutations into account, which our method does. Otherwise, 
important similarities between protein structures will not be found. For this example 
several restarts were performed with random initialization. 



Discussion 

A new approach to structure alignment has been presented and explored. It is based 
upon an error function encoding in terms of both binary assignment variables and real- 
valued atom coordinates. The encoding allows for an extensive search through all possible 
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5 



18 



26 



59 



65 



71 



ILEA 
1XGNB4 



TARQQEVFDLIRDH 
VAQARFLLAKIKRE 



PTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIV 
FAYRWLQN-D-M-PEGQLKLALKTLEKAGAIYGY 



-GIRLLQE 
IYMYVRDV 



216 



229 



235 



265 



206 



212 



Figure 5: Alignment of ILEA against the Entrez-Mmdb domain 1XGNb4. The '*' denotes atoms 
also aligned by Vast. 1XGNb4 is a circularly permuted version of ILEA and our method finds this 
and aligns the sheet at the end of ILEA with the sheet at the beginning of the domain in IXGNb. 



alignments, including the ones involving arbitrary permutations. 

The error function is efficiently minimized using a mean field approximation of the assign- 
ment variables and exact translation/rotation of the atom coordinates. As a by-product 
of this approximation, a probabilistic interpretation of the result is available without 
tedious stochastic simulations. The approach is not sensitive to the choice of distance 
metric, and hence to a large extent ignores outliers. 

Despite some conceptual similarities with the iterative dynamical programming method 
P|, our approach is probabilistic and more general. Also, and maybe more importantly, 
it is quite different since permutations are allowed from the outset. For the latter reason, 
the algorithm in H cannot be derived as a special case in any limit. 

The method is readily extended to handle more detailed chain representations (e.g. side- 
chain orientation) and user-provided constraints of almost any kind. 

The approach is evaluated using pairs of protein chains chosen to represent a wide variety 
of situations and the resulting alignments are successfully compared with other methods 
that are available on WWW-servers. This evaluation is done using (^-representations 
of the chains. 

Despite being very flexible, generic and covering the entire space of alignments the 
method is on the average as fast as ||, slightly slower than || and significantly faster 
than [H. Also, it is very robust with respect to the algorithmic parameters used with a 
few exceptions. Once side-chains are included, the latter will disappear. 
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Biological Implications 



There is a strong need for efficient protein structure alignment algorithms. Aligning 
proteins forms the basis for studying functional relationships among proteins and con- 
struction of phylogenetic trees. It is also very important for structure prediction. 
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